Collocations as Word Co-ocurrence Restriction Data - An Application to Japanese Word Processor
نویسندگان
چکیده
Collocations, the combination of specific words are quite useful linguistic resources for NLP in general. The purpose of this paper is to show their usefulness, exemplifying an application to Kanji character decision processes for Japanese word processors. Unlike recent trials of automatic extraction, our collocations were collected manually through many years of intensive investigation of corpus. Our collection procedure consists of (1) finding a proper combination of words in a corpus and (2) recollecting similar combinations of words, incited by it. This procedure, which depends on human judgment and the enrichment of data by association, is effective for remedying the sparseness of data problem, although the arbitrariness of human judgment is inevitable. Approximately seventy two thousand and four hundred collocations were used as word co-occurrence restriction data for deciding Kanji characters in the processing of Japanese word processores. Experiments have shown that the collocation data yield 8.9% higher fraction of Kana-toKanji character conversion accuracy than the system which uses no collocation data and 7.0% higher, than a commercial word processor software of average performance.
منابع مشابه
Extracting Bilingual Collocations from Non-Aligned Parallel Corpora
This paper proposes a new method to find correspondences of uninterrupted collocations from Japanese-English bilingual corpora without sentence-to-sentence alignment. Uninterrupted collocations in English such as “once again”, “give up”, or “gross national product” handled as a single word or a compound word in Japanese, can be automatically extracted with corresponding Japanese words using wor...
متن کاملLarge Scale Collocation Data and Their Application to Japanese Word Processor Technology
Word processors or computers used in Japan employ Japanese input method through keyboard stroke combined with Kana (phonetic) character to Kanji (ideographic, Chinese) character conversion technology. The key factor of Kana-to-Kanji conversion technology is how to raise the accuracy of the conversion through the homophone processing, since we have so many homophonic Kanjis. In this paper, we re...
متن کاملIntellectual structure of knowledge in Nanomedicine field (2009 to 2018): A Co-Word Analysis
Introduction: The Co-word analysis has the ability to identify the intellectual structure of knowledge in a research domain and reveal its subsurface research aspects. Objective: This study examines the intellectual structure of knowledge in the field of nanomedicine during the period of 2009 to 2018 by using Co-word analysis. Materials and Methods: This paper develops a sciento...
متن کاملSurvey of Word Co-occurrence Measures for Collocation Detection
This paper presents a detailed survey of word co-occurrence measures used in natural language processing. Word co-occurrence information is vital for accurate computational text treatment, it is important to distinguish words which can combine freely with other words from other words whose preferences to generate phrases are restricted. The latter words together with their typical co-occurring ...
متن کاملIntegrating Morphology With Multi-Word Expression Processing In Turkish
This paper describes a multi-word expression processor for preprocessing Turkish text for various language engineering applications. In addition to the fairly standard set of lexicalized collocations and multi-word expressions such as named-entities, Turkish uses a quite wide range of semi-lexicalized and non-lexicalized collocations. After an overview of relevant aspects of Turkish, we present...
متن کامل